Fitting curve to data

Within this notebook we do some data analytics on historical data to feed some real numbers into the model. Since we assume the consumer data to be resemble a sinus, due to the fact that demand is seasonal, we will focus on fitting data to this kind of curve.


In [69]:
import numpy as np
from scipy.optimize import leastsq
import pylab as plt
import pandas as pd

N = 1000 # number of data points
t = np.linspace(0, 4*np.pi, N)
data = 3.0*np.sin(t+0.001) + 0.5 + np.random.randn(N) # create artificial data with noise

guess_mean = np.mean(data)
guess_std = 3*np.std(data)/(2**0.5)
guess_phase = 0

# we'll use this to plot our first estimate. This might already be good enough for you
data_first_guess = guess_std*np.sin(t+guess_phase) + guess_mean

# Define the function to optimize, in this case, we want to minimize the difference
# between the actual data and our "guessed" parameters
optimize_func = lambda x: x[0]*np.sin(t+x[1]) + x[2] - data
est_std, est_phase, est_mean = leastsq(optimize_func, [guess_std, guess_phase, guess_mean])[0]

# recreate the fitted curve using the optimized parameters
data_fit = est_std*np.sin(t+est_phase) + est_mean

plt.plot(data, '.')
plt.plot(data_fit, label='after fitting')
plt.plot(data_first_guess, label='first guess')
plt.legend()
plt.show()


import data for our model

This is data imported from statline CBS webportal.


In [70]:
importfile = 'CBS Statline Gas Usage.xlsx'
df = pd.read_excel(importfile, sheetname='Month', skiprows=1)
df.drop(['Onderwerpen_1', 'Onderwerpen_2', 'Perioden'], axis=1, inplace=True)

df


Out[70]:
Onderwerpen_3 2010 januari 2010 februari 2010 maart 2010 april 2010 mei 2010 juni 2010 juli 2010 augustus 2010 september ... 2016 maart 2016 april 2016 mei 2016 juni 2016 juli 2016 augustus 2016 september 2016 oktober 2016 november 2016 december
0 Totaal verbruik 6917.000000 5834.000000 5075.000000 3724.0000 3538.0000 2658.000 2581.000000 2599.000000 3024.000000 ... 4331.000000 3153.000000 2284.000000 2064.0 2086.0 2170.0 2274.0 3359.0 4301.0 4770.0
1 Totaal via het hoofdtransportnet 2638.000000 2290.000000 2381.000000 2056.0000 2027.0000 1844.000 1927.000000 1837.000000 1936.000000 ... 1769.000000 1498.000000 1358.000000 1384.0 1468.0 1552.0 1649.0 1800.0 1845.0 1977.0
2 Elektriciteitscentrales 1066.000000 941.000000 965.000000 841.0000 742.0000 673.000 698.000000 668.000000 729.000000 ... 569.000000 436.000000 368.000000 475.0 520.0 576.0 659.0 697.0 785.0 794.0
3 Overige verbruikers 1572.000000 1349.000000 1416.000000 1215.0000 1285.0000 1171.000 1229.000000 1169.000000 1207.000000 ... 1200.000000 1062.000000 990.000000 909.0 948.0 976.0 990.0 1103.0 1060.0 1183.0
4 Via regionale netten 4218.000000 3490.000000 2636.000000 1614.0000 1458.0000 763.000 603.000000 709.000000 1042.000000 ... 2494.000000 1592.000000 859.000000 632.0 561.0 559.0 575.0 1494.0 2392.0 2732.0
5 Verbruik bij de winning 61.000000 54.000000 58.000000 54.0000 53.0000 51.000 51.000000 53.000000 46.000000 ... 68.000000 63.000000 67.000000 48.0 57.0 59.0 50.0 65.0 64.0 61.0
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN 14.644737 13.947368 11.931739 13.5475 16.8825 19.275 19.497619 18.188636 18.980952 ... 12.277143 12.100952 12.725556 NaN NaN NaN NaN NaN NaN NaN

9 rows × 85 columns


In [71]:
# transpose
df = df.transpose()

In [72]:
new_header = df.iloc[0]
df = df[1:]
df.rename(columns = new_header, inplace=True)

In [127]:
#df.drop(['nan'], axis=0, inplace=True)
df


Out[127]:
Totaal verbruik Totaal via het hoofdtransportnet Elektriciteitscentrales Overige verbruikers Via regionale netten Verbruik bij de winning nan nan nan
2010 januari 6917 2638 1066 1572 4218 61 NaN NaN 14.6447
2010 februari 5834 2290 941 1349 3490 54 NaN NaN 13.9474
2010 maart 5075 2381 965 1416 2636 58 NaN NaN 11.9317
2010 april 3724 2056 841 1215 1614 54 NaN NaN 13.5475
2010 mei 3538 2027 742 1285 1458 53 NaN NaN 16.8825
2010 juni 2658 1844 673 1171 763 51 NaN NaN 19.275
2010 juli 2581 1927 698 1229 603 51 NaN NaN 19.4976
2010 augustus 2599 1837 668 1169 709 53 NaN NaN 18.1886
2010 september 3024 1936 729 1207 1042 46 NaN NaN 18.981
2010 oktober 4147 2345 983 1362 1742 60 NaN NaN 18.5762
2010 november 5017 2315 944 1371 2632 70 NaN NaN 19.4905
2010 december 6868 2499 994 1505 4301 68 NaN NaN 24.2864
2011 januari 5818 2304 804 1500 3453 61 NaN NaN 22.5125
2011 februari 5007 2029 728 1301 2921 57 NaN NaN 21.7842
2011 maart 4704 2129 740 1389 2515 60 NaN NaN 23.7587
2011 april 3136 1922 712 1210 1160 54 NaN NaN 22.9139
2011 mei 2981 2008 745 1263 917 56 NaN NaN 22.95
2011 juni 2549 1754 576 1178 742 53 NaN NaN 22.6295
2011 juli 2643 1838 595 1243 746 59 NaN NaN 21.5025
2011 augustus 2702 1933 646 1287 714 55 NaN NaN 21.9035
2011 september 2768 1881 649 1232 827 60 NaN NaN 23.1024
2011 oktober 3601 2076 748 1328 1468 57 NaN NaN 22.4943
2011 november 4479 2211 891 1320 2203 65 NaN NaN 23.7976
2011 december 4989 2143 962 1181 2781 65 NaN NaN 22.4048
2012 januari 5246 2076 758 1318 3104 66 NaN NaN 22.0125
2012 februari 5920 2174 833 1341 3682 64 NaN NaN 26.2215
2012 maart 3911 1773 513 1260 2069 69 NaN NaN 23.9432
2012 april 3600 1685 560 1125 1855 60 NaN NaN 24.8658
2012 mei 2747 1642 493 1149 1036 69 NaN NaN 24.1865
2012 juni 2507 1604 481 1123 839 64 NaN NaN 23.8768
... ... ... ... ... ... ... ... ... ...
2014 juli 2185 1605 501 1104 531 49 NaN NaN 16.5168
2014 augustus 2380 1670 512 1158 646 64 NaN NaN 17.4645
2014 september 2469 1738 623 1115 672 59 NaN NaN 20.851
2014 oktober 2876 1714 606 1108 1101 61 NaN NaN 21.2626
2014 november 3660 1693 556 1137 1900 67 NaN NaN 22.9568
2014 december 4798 1882 583 1299 2851 65 NaN NaN 22.53
2015 januari 5014 1836 524 1312 3112 66 NaN NaN 19.7505
2015 februari 4625 1804 607 1197 2758 63 NaN NaN 22.4884
2015 maart 4118 1657 456 1201 2395 66 NaN NaN 21.8682
2015 april 2940 1383 324 1059 1497 60 NaN NaN 22.0405
2015 mei 2367 1324 249 1075 979 64 NaN NaN 20.5611
2015 juni 1974 1222 284 938 696 56 NaN NaN 20.4868
2015 juli 2063 1448 403 1045 551 64 NaN NaN 20.8595
2015 augustus 1994 1378 375 1003 550 66 NaN NaN 19.5671
2015 september 2457 1523 416 1107 879 55 NaN NaN 19.1695
2015 oktober 3356 1747 667 1080 1548 61 NaN NaN 18.2268
2015 november 3582 1710 656 1054 1808 64 NaN NaN 17.154
2015 december 3712 1652 566 1086 1991 69 NaN NaN 15.77
2016 januari 4795 1809 664 1145 2915 71 NaN NaN 13.8942
2016 februari 4339 1608 592 1016 2668 63 NaN NaN 12.441
2016 maart 4331 1769 569 1200 2494 68 NaN NaN 12.2771
2016 april 3153 1498 436 1062 1592 63 NaN NaN 12.101
2016 mei 2284 1358 368 990 859 67 NaN NaN 12.7256
2016 juni 2064 1384 475 909 632 48 NaN NaN NaN
2016 juli 2086 1468 520 948 561 57 NaN NaN NaN
2016 augustus 2170 1552 576 976 559 59 NaN NaN NaN
2016 september 2274 1649 659 990 575 50 NaN NaN NaN
2016 oktober 3359 1800 697 1103 1494 65 NaN NaN NaN
2016 november 4301 1845 785 1060 2392 64 NaN NaN NaN
2016 december 4770 1977 794 1183 2732 61 NaN NaN NaN

84 rows × 9 columns


In [134]:
x = range(len(df.index))
df['Via regionale netten'].plot(figsize=(18,5))
plt.xticks(x, df.index, rotation='vertical')
plt.show()


now let fit different consumer groups


In [111]:
#b = self.base_demand
#m = self.max_demand
#y = b + m * (.5 * (1 + np.cos((x/6)*np.pi)))
b = 603
m = 3615

N = 84 # number of data points
t = np.linspace(0, 83, N)
#data = b + m*(.5 * (1 + np.cos((t/6)*np.pi))) + 100*np.random.randn(N) # create artificial data with noise
data = np.array(df['Via regionale netten'].values, dtype=np.float64)

guess_mean = np.mean(data)
guess_std = 2*np.std(data)/(2**0.5)
guess_phase = 0

# we'll use this to plot our first estimate. This might already be good enough for you
data_first_guess = guess_mean + guess_std*(.5 * (1 + np.cos((t/6)*np.pi + guess_phase)))

# Define the function to optimize, in this case, we want to minimize the difference
# between the actual data and our "guessed" parameters
optimize_func = lambda x: x[0]*(.5 * (1 + np.cos((t/6)*np.pi+x[1]))) + x[2] - data
est_std, est_phase, est_mean = leastsq(optimize_func, [guess_std, guess_phase, guess_mean])[0]

# recreate the fitted curve using the optimized parameters
data_fit = est_mean + est_std*(.5 * (1 + np.cos((t/6)*np.pi + est_phase)))

plt.plot(data, '.')
plt.plot(data_fit, label='after fitting')
plt.plot(data_first_guess, label='first guess')
plt.legend()
plt.show()
print(est_std, est_phase, est_mean)


2695.90755615 -0.598507406084 380.296222028

In [113]:
#data = b + m*(.5 * (1 + np.cos((t/6)*np.pi))) + 100*np.random.randn(N) # create artificial data with noise
data = np.array(df['Elektriciteitscentrales'].values, dtype=np.float64)

guess_mean = np.mean(data)
guess_std = 3*np.std(data)/(2**0.5)
guess_phase = 0

# we'll use this to plot our first estimate. This might already be good enough for you
data_first_guess = guess_mean + guess_std*(.5 * (1 + np.cos((t/6)*np.pi + guess_phase)))

# Define the function to optimize, in this case, we want to minimize the difference
# between the actual data and our "guessed" parameters
optimize_func = lambda x: x[0]*(.5 * (1 + np.cos((t/6)*np.pi+x[1]))) + x[2] - data
est_std, est_phase, est_mean = leastsq(optimize_func, [guess_std, guess_phase, guess_mean])[0]

# recreate the fitted curve using the optimized parameters
data_fit = est_mean + est_std*(.5 * (1 + np.cos((t/6)*np.pi + est_phase)))

plt.plot(data, '.')
plt.plot(data_fit, label='after fitting')
plt.plot(data_first_guess, label='first guess')
plt.legend()
plt.show()
print(est_std, est_phase, est_mean)


261.178751524 0.423813583566 483.327290904

In [137]:
#data = b + m*(.5 * (1 + np.cos((t/6)*np.pi))) + 100*np.random.randn(N) # create artificial data with noise
data = np.array(df['Overige verbruikers'].values, dtype=np.float64)

guess_mean = np.mean(data)
guess_std = 3*np.std(data)/(2**0.5)
guess_phase = 0
guess_saving = .997

# we'll use this to plot our first estimate. This might already be good enough for you
data_first_guess = np.power(guess_saving,t) * (guess_mean + guess_std*(.5 * (1 + np.cos((t/6)*np.pi + guess_phase))))

# Define the function to optimize, in this case, we want to minimize the difference
# between the actual data and our "guessed" parameters
optimize_func = lambda x: x[0]*(.5 * (1 + np.cos((t/6)*np.pi+x[1]))) + x[2] - data
est_std, est_phase, est_mean = leastsq(optimize_func, [guess_std, guess_phase, guess_mean])[0]

# recreate the fitted curve using the optimized parameters
data_fit = est_mean + est_std*(.5 * (1 + np.cos((t/6)*np.pi + est_phase)))

plt.plot(data, '.')
plt.plot(data_fit, label='after fitting')
plt.plot(data_first_guess, label='first guess')
plt.legend()
plt.show()
print('max_demand: %s, phase_shift: %s, base_demand:%s' %(est_std, est_phase, est_mean))


max_demand: 210.422418112,/n phase_shift: 0.0416396205406, base_demand:1077.34831476

In [ ]: